Method and System for Virtual Reality (VR) Video Transcode By Extracting Residual From Different Resolutions

ABSTRACT

A VR video transcoding method is disclosed. In the method, source VR video data is decoded to obtain an audio data set and a frame data set type one. The source VR video data and the frame data set type one have a source resolution. A frame data set type two and a third frame data are obtained from the frame data set type one. The frame data set type two and the frame data set type three have the same target resolution and are obtained by different manners. An enhancement data set is obtained by subtracting the frame data set type three from the frame data set type two. A base video set is obtained by combining and segmenting the frame data set type two and the audio data set. An enhancement video set is obtained by encoding and segmenting the enhancement data set. The base video set and the enhancement video set are used for VR video playbacks. The base video set and the enhancement video set are transmitted separately and combined with each other into video content to improve transmission efficiency.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the priority and benefit of U.S. provisional application 62/441,936, filed on Jan. 3, 2017, which is incorporated herein by reference in its entirety.

BACKGROUND OF THE DISCLOSURE Field of the Disclosure

The present disclosure relates to video processing technology, and more particularly, to a method and a system for transcoding a VR video.

Background of the Disclosure

Virtual Reality (VR) is a computer simulation technology for creating and experiencing a virtual world. For example, a three-dimensional real-time image can be presented based on a technology which tracks a user's head, eyes or hands. In the network-based virtual reality technology, full-view video data can be pre-stored on a server, and then transmitted to a display device. A display device can be glasses, a head-mounted display, etc. A video is displayed on the display device in accordance with a viewport of the user.

However, a high-resolution video data occupies large transmission of bandwidth and requires high computing power from the display devices. Presenting high-resolution VR video on internet is difficult, Precisely, the existing video streaming technology cannot fulfill the virtual reality.

Therefore, in order to present VR video smoothly in real-time, it is desirable to further improve the existing video streaming technology to save bandwidth and reduce performance requirements for display devices, by a new way to encode and store the VR video data on the server.

SUMMARY OF THE DISCLOSURE

In view of this, the present disclosure relates to a method and a system for video transcoding to solve the above problems.

According to the first aspect of the present disclosure, there is a provided method for transcoding a VR video, which comprises: obtaining an audio data set and a frame data set type one by decoding source VR video data; obtaining a frame data set type two from the frame data set type one; obtaining a frame data set type three from the frame data set type one; obtaining an enhancement data set by subtracting the frame data set type two from the frame data set type three; obtaining a base video set by combining and segmenting the frame data set type two and the audio data set; and obtaining an enhancement video set by encoding and segmenting the enhancement data set. Wherein, the source VR video data and the frame data set type one have a source resolution, the frame data set type two and the frame data set type three have the same target resolution and are obtained by different manners, and the base video set and the enhancement video set are used for VR video playbacks.

Preferably, the source resolution is greater than or equal to the target resolution.

Preferably, the step of obtaining the frame data set type two from the frame data set type one comprises: obtaining the frame data set type two by scaling down the frame data set type one to the target resolution losslessly.

Preferably, the step of obtaining the frame data set type three from the frame data set type one comprises: compressing the frame data set type one by a predetermined video encoding method; obtaining a base video having a basic resolution from the compressed frame data set type one by decreasing resolution; obtaining base video data by decoding the base video; and obtaining the frame data set type three by scaling up the base video data with a interpolation algorithm type one. Wherein, the basic resolution is less than the target resolution.

Preferably, the interpolation algorithm type one is a bilinear interpolation algorithm.

Preferably, the step of obtaining the base video set by combining and segmenting the frame data set type two and the audio data set comprises: combining the frame data set type two and the audio data set into at least one video with sound track; and obtaining the base video set by segmenting the at least one video with sound track in accordance with a timeline.

Preferably, the step of obtaining the enhancement video set by encoding and segmenting the enhancement data set comprises: compressing the enhancement data set by a predetermined video encoding method; and obtaining the enhancement video set by a segmenting process operated on the enhancement data set.

Preferably, the segmenting process is performed in accordance with a timeline and/or in a spatial dimension.

Preferably, the spatial dimension is related to a user's viewport.

According to the second aspect of the present disclosure, there is a provided system for transcoding a VR video, which comprises: a segmentation module, configured to decode source VR video data to obtain an audio data set and a frame data set type one; a first generating module, configured to obtain a frame data set type two from the frame data set type one; a second generating module, configured to obtain a frame data set type three from the frame data set type one; a difference calculation module, configured to subtract the frame data set type three from the frame data set type two to obtain an enhancement data set; a combining and segmenting module, configured to combine and segment the frame data set type two and the audio data set to obtain a base video set; an encoding and segmenting module, configured to encode and segment the enhancement data set to obtain an enhancement video set; and a storage module, configured to store the base video set and the enhancement video set. Wherein, the source VR video data and the frame data set type one have a source resolution, the frame data set type two and the frame data set type three have the same target resolution and are obtained by different manners, and the base video set and the enhancement video set are used for VR video playbacks.

Preferably, the source resolution is greater than or equal to the target resolution.

The present disclosure provides high quality display with better efficiency by transcoding a VR video to the base video and the enhancement video, by storing videos as the base video set and the enhancement video set, by transmitting the base video and the enhancement video separately, and by retrieving a high quality video from the base video and the enhancement video during video playbacks.

BRIEF DESCRIPTION OF THE DRAWINGS

Features and benefits relates to the present disclosure will be emphasized by attached figures and descriptions. The attached figures includes:

FIG. 1 is a diagram illustrating an example of the network-based virtual reality playback system;

FIG. 2 is a flowchart showing a method used in the VR playback system of FIG. 1;

FIG. 3 is a flowchart showing a method for transcoding a VR video according to an embodiment of the present disclosure;

FIG. 4 is a flowchart showing detailed steps of obtaining a frame data set type three in FIG. 3;

FIG. 5 is a flowchart showing detailed steps of obtaining a base video set in FIG. 3;

FIG. 6 is a flowchart showing detailed steps of obtaining an enhancement video set in FIG. 3; and

FIG. 7 is a block diagram illustrating a system for transcoding a VR video according to an embodiment of the present disclosure.

DETAILED DESCRIPTION OF THE DISCLOSURE

Exemplary embodiments of the present disclosure will be described in more details below with reference to the accompanying drawings. In the drawings, like reference numerals denote like members. The figures are not drawn to scale, for the sake of clarity. Moreover, some well-known parts may not be shown.

FIG. 1 is a diagram illustrating an example network of a VR playback system. The VR playback system 10 includes a server 100 and a display device 120 which are coupled with each other through a network 110, and a VR device. For example, the server 100 may be a stand-alone computer server or a server cluster. The server 100 is used to store various video data and to store various applications that process these video data. For example, various daemons run on the server 100 in real time, so as to process various video data in the server 100 and to respond various requests from VR devices and the display device 120. The network 110 may be a selected one or selected ones from the group consisting of an internet, a local area network, an internet of things, and the like. For example, the display device 120 may be any of the computing devices, including a computer device having an independent display screen and a processing capability. The display device 120 may be a personal computer, a laptop computer, a computer workstation, a server, a mainframe computer, a palmtop computer, a personal digital assistant, a smart phone, an intelligent electrical apparatus, a game console, an iPad/iPhone, a video player, a DVD recorder/player, a television, or a home entertainment system. The display device 120 may store VR player software as a VR player. When the VR player is started, it requests and downloads various video data from the server 100, and renders and plays the video data in the display device. In this example, the VR device 130 is a stand-alone head-mounted device that can interact with the display device 120 and the server 100, to communicate the user's current information to the display device 120 and/or the server 100 through signaling. The user's current information is, for example, a viewing angle of the user. According to the information, the display device 120 can flexibly process the currently played video data. In some embodiments, when a position of user's helmet changes, the display device 120 determines that a core viewing region for the user has been changed and starts to play video data with high resolution in the changed core viewing region.

In the above embodiment, the VR device 130 is a stand-alone head-mounted device. However, those skilled in the art should understand that the VR device 130 is not limited thereto, and the VR device 130 may also be an all-in-one head-mounted device. The all-in-one head-mounted device itself has a display screen, so that it is not necessary to connect the all-in-one head-mounted device with the external display device. For example, in this example, if the all-in-one head-mounted device is used as the VR device, the display device 120 may be eliminated. At this point, the all-in-one head-mounted device is configured to obtain video data from the server 100 and to perform playback operation, and the all-in-one head-mounted device is also configured to detect user's current position modification information and to adjust the playback operation according to the position modification information.

FIG. 2 is a flowchart showing a method used in the VR playback system of FIG. 1. The method includes the following steps.

In step S10, a video data processing procedure is operated on server side.

In step S20, display device side obtains the position modification information by interacting with the VR device.

In step S30, according to the user's position modification information, the display device side requests the server side to provide the video data and receives the video data.

In step S40, the display device side renders the received video data.

Wherein, the step S10 is used to process the video data stored on the server side. Unlike conventional processing method in which source VR video data is directly stored as display data and provided to the display device, the source VR video data is further processed during the video data processing procedure according to the embodiments of the present disclosure. For example, the source VR video data with one video coding format is converted to the video data with another video coding format by a video encoding method. Or, the source VR video data with a low resolution is converted to the video data with a high resolution, in order to meet high demand required by the display device. Wherein, the video encoding method is a method for converting a video data file organized in one video coding format to a video data file organized in another video coding format by a specific compressing technique. Currently, the most important coding standards for video streaming transmission includes H.261, H.263 and H.264 which are set by the International Telecommunication Union.

FIG. 3 is a flowchart showing a method for transcoding a VR video according to an embodiment of the present disclosure. The method shown in FIG. 3 may be used in the above-described video data processing procedure as a preferred embodiment. The method for transcoding the VR video includes following steps specifically.

In step S100, the source VR video data is decoded into an audio data set type one and a frame data set type one.

The source VR video data contains audio data and full-view video data. For example, the source VR video data contains video data in a horizontal 360-degree and vertical 120-degree viewing angle range. The source VR video data may have an original resolution of 12,600×6,000 pixels. The video data includes image data distributed in a plurality of consecutive frames. The image data of each frame is required to have the original resolution. In this step, the source VR video data is decoded according to the video coding format of the source VR video data, and the audio data of the plurality of frames and the image data of the plurality of frames are extracted. The audio data of the plurality of frames constitutes an audio data set, and the image data of the plurality of frames constitutes a frame data set.

In step S200, a frame data set type two is obtained by scaling down the frame data set type one losslessly to a target resolution.

For example, the source VR video data has the original resolution of 12,600×6,000 pixels and the target resolution is 6,300×3,000 pixels. By decreasing the resolution of the source VR video data, the video data with the target resolution can be obtained from the source VR video data with the original resolution. Since the frame data set type one contains video data distributed in the plurality of frames, the obtained frame data set type two also contains data distributed in the plurality of frames.

In step S300, the frame data set type one is scaled down during encoding and scaled up during decoding, for being converted into a frame data set type three with the target resolution.

For example, the original resolution of the source VR video data is 12,600×6,000 pixels, so that data of each frame in the frame data set type one has a resolution of 12,600×6,000 pixels. In this step, a video data having a basic resolution is obtained firstly by scaling down the frame data set during encoding, and then the frame data set type three is obtained by scaling up the video data to the target resolution during decoding. Both of the frame data set type three and the frame data set type two contain video data distributed in the plurality of frames, and have a same resolution equal to the target resolution, however, the frame data set type three and the frame data set type two are obtained by different manners. In addition, in the embodiment, a following relationship is satisfied:

the original resolution≥the target resolution≥the basic resolution.

For the same display device, the video data with low resolution generates low quality images, and the video data with high resolution generates high quality images.

In step S400, an enhancement data set is obtained by a subtraction between the frame data set type two and the frame data set type three.

In this step, for example, the enhancement data set is obtained by the subtraction between the frame data set type two and the frame data set type three, following the formulas described below.

It is assumed that Px, y^(Original)=(r, g, b)^(T) is a pixel with a coordinate (x, y) in the frame data set type two, wherein r, g, b∈[L,H], and Px, y^(ScaledBase)=(r′, g′, b′)^(T) is a pixel a pixel in the frame data set type two with a coordinate (x, y) in the frame data set type three, wherein r′, g′, b′∈[L, H], then all x and y satisfies the following formula (1):

$\begin{matrix} {{Px},{y^{NormalizedResidual} = {Px}},{y^{Original} - {Px}},{y^{ScaledBase} + \frac{H - L}{2}}} & (1) \end{matrix}$

That is to say, value of each pixel in the enhancement data set is obtained by a subtraction operated on r, g, b components of a same pixel in the frame data set type two and the frame data set type three.

However, those skilled in the art should understand that the present disclosure is not limited thereto, it can also be extended into other embodiments using other color space models such as YUV, CMY, HSV, HSI.

In step S500, a base video set is obtained by combining and segmenting the frame data set type two and the audio data set type one.

Video-audio data is obtained by combining data in each frame of the frame data set type two and data in each frame of the audio data set type one, and then the video-audio data is segmented in accordance with a timeline. For example, a time segmenting unit is defined, and the video-audio data with a certain duration is segmented into a plurality of videos with sound track according to the time segmenting unit. For example, a 2-minute time segmenting unit is defined, and the video-audio data with a 10-minute duration is segmented into five 2-minute videos with sound track according to the time segmenting unit.

In step S600, an enhancement video set is obtained by encoding, compressing and segmenting the enhancement data set.

The enhancement data set is segmented after being encoded by a predetermined coding method, wherein the predetermined coding method includes H.264, H.264 and so on. The enhancement video set is obtained by a segmenting process operated on the encoded video data. The segmenting process is performed in accordance with a timeline and/or in a spatial dimension. For convenience, the segmenting process is generally based on a time segmenting unit same with that defined in the step S500. While the segmenting process in the spatial dimension may be implemented in a variety of ways. For example, after a space segmenting unit is defined, the video data is segmented into a collection of sub data uniformly according to the space segmenting unit.

FIG. 4 is a flowchart showing detailed steps of obtaining a frame data set type three in FIG. 3.

In step S301, the frame data set type one is compressed and encoded by a predetermined video encoding method.

In this step, each frame of data in the frame data set type one is compressed by a specific video encoding method, which is, for example, based on H.264 or H.265 format.

In step S302, a base video having a basic resolution is obtained from the compressed frame data set type one by decreasing resolution.

By using one of the algorithms including nearest neighbor interpolation algorithm, bilinear interpolation algorithm, cubic convolution algorithm and other algorithms, the base video is obtained according to the frame data set type one, wherein the base video is the video data having the basic resolution.

In step S303, a base video data is obtained by decoding the base video having the basic resolution.

The base video is decoded by a decoding method corresponding to the step S301, in order to generate a frame data set having the basic resolution.

In step S304, a frame data set type three is obtained by interpolation algorithm from the base video data.

By using one or more of the algorithms including nearest neighbor interpolation algorithm, bilinear interpolation algorithm, cubic convolution algorithm and other algorithms, video data having the target resolution is obtained based on the base video having the basic resolution, that is, the frame data set type three is obtained. The target resolution is equal to or higher than the basic resolution.

FIG. 5 is a flowchart showing detailed steps of obtaining a base video set in FIG. 3.

In step S501, at least one video with sound track is obtained by combining the frame data set type two and the audio data set type one.

In this step, the at least one completed video with sound track is obtained by recombining the frame data set type two and the audio data set type one. The at least one video with sound track has the target resolution. Different videos with sound track of the at least one video with sound track may have a same duration or different durations.

In step S502, the at least one video with sound track is segmented into a base video set.

The at least one video with sound track is segmented in accordance with a timeline. The at least one video with sound track is segmented into a plurality of sub pieces with a same duration or different durations, and the plurality of sub pieces constitute the base video set.

FIG. 6 is a flowchart showing detailed steps of obtaining an enhancement video set in FIG. 3.

In step S601, the enhancement data set is compressed and encoded by a predetermined video encoding method.

In this step, each frame of data in the frame data set type one is compressed by a specific video encoding method, which is, for example, based on H.264 or H.265 format.

In step S602, the encoded enhancement data set is segmented to generate an enhancement video set.

In this step, by a segmenting process which is same with the segmenting process in the step S500, the enhancement video set is obtained.

The segmenting process is performed in accordance with a timeline and/or in a spatial dimension. For convenience, the segmenting process is generally based on a time segmenting unit same with that defined in the step S500. While the segmenting process in the spatial dimension may be implemented in a variety of ways. For example, a space segmenting unit is defined, and the video data is segmented into a collection of sub data uniformly according to the space segmenting unit.

The method for transcoding the VR video according to the embodiments of the present disclosure processes VR video data into the base video set and the enhancement video set. During playback, the enhancement video set is superposed with the base video set, in order to implement high quality display. The base video set and the enhancement video set are transmitted separately to improve transmission efficiency.

FIG. 7 is a block diagram illustrating a system for transcoding a VR video according to an embodiment of the present disclosure. The system for transcoding the VR video comprises a segmentation module 701, a first generating module 702, a second generating module 703, a difference calculation module 704, a combining and segmenting module 705, an encoding and segmenting module 706 and a storage module 707.

The segmentation module 701 is configured to decode source VR video data to obtain an audio data set and a frame data set type one, wherein the source VR video data and the frame data set type one both have a source resolution.

The first generating module 702 is configured to obtain a frame data set type two from the frame data set type one.

The second generating module 703 is configured to obtain a frame data set type three from the frame data set type one. The frame data set type two and the frame data set type three have a same target resolution and are obtained by different manners. For example, the frame data set type two is obtained by decreasing a resolution of the frame data set type one to the target resolution, and the frame data set type three is obtained by decreasing the resolution of the frame data set type one to the target resolution after the frame data set type one is encoded and compressed. The first, the second and the frame data set type threes each contain data distributed in a plurality of frames.

The difference calculation module 704 is configured to subtract the frame data set type three from the frame data set type two to obtain an enhancement data set. The enhancement data set is obtained by subtracting the frame data set type three generated by the second generating module 703 from the frame data set type two generated by the first generating module 702.

The combining and segmenting module 705 is configured to combine and segment the frame data set type two and the audio data set to obtain a base video set.

The encoding and segmenting module 706 is configured to encode and segment the enhancement data set to obtain an enhancement video set.

The storage module 707 is configured to store the base video set and the enhancement video set. The base video set and the enhancement video set are used for VR video playbacks. During the playback, the enhancement video set is superposed with the base video set, in order to implement high quality display. Preferably, the source resolution is greater than or equal to the target resolution.

The system for transcoding the VR video according to the embodiments of the present disclosure is configured to process VR video data into the base video set and the enhancement video set, and to transmit the base video set and the enhancement video set in order to implement high quality transmission, at the same time, effect of the playback is not influenced or even improved.

The embodiments were chosen and described in order to best explain the principles of the disclosure and its practical applications, to thereby enable others skilled in the art to best utilize the disclosure. The disclosure is intended to cover alternatives, modifications and equivalents that may be included within the spirit and scope of the disclosure as defined by the appended claims.

The foregoing descriptions of specific embodiments of the present disclosure have been presented, but are not intended to limit the disclosure to the precise forms disclosed. It will be readily apparent to one skilled in the art that many modifications and changes may be made in the present disclosure. Any modifications, equivalence, variations of the preferred embodiments can be made without departing from the doctrine and spirit of the present disclosure. 

1. A method for transcoding a VR video, comprising: obtaining an audio data set and a frame data set type one by decoding source VR video data, wherein the source VR video data and the frame data set type one have a source resolution; obtaining a frame data set type two from the frame data set type one; obtaining a frame data set type three from the frame data set type one, wherein the frame data set type two and the frame data set type three have the same target resolution and are obtained by different manners; obtaining an enhancement data set by subtracting the frame data set type two from the frame data set type three; obtaining a base video set by combining and segmenting the frame data set type two and the audio data set; and obtaining an enhancement video set by encoding and segmenting the enhancement data set, wherein the base video set and the enhancement video set are used for VR video playbacks.
 2. The method according to claim 1, wherein the source resolution is equal to or higher than the target resolution.
 3. The method according to claim 2, wherein the step of obtaining the frame data set type two from the frame data set type one comprises: obtaining the frame data set type two by scaling down the frame data set type one losslessly to the target resolution.
 4. The method according to claim 2, wherein the step of obtaining the frame data set type three from the frame data set type one comprises: compressing the frame data set type one by a predetermined video encoding method; obtaining a base video having a basic resolution from the compressed frame data set type one by decreasing resolution, wherein the basic resolution is less than the target resolution; obtaining base video data by decoding the base video; and obtaining the frame data set type three by scaling up the base video data with a interpolation algorithm type one.
 5. The method according to claim 4, wherein the interpolation algorithm type one is a bilinear interpolation algorithm.
 6. The method according to claim 1, wherein the step of obtaining the base video set by combining and segmenting the frame data set type two and the audio data set comprises: combining the frame data set type two and the audio data set into at least one video with sound track; and obtaining the base video set by segmenting the at least one video with sound track in accordance with a timeline.
 7. The method according to claim 1, wherein the step of obtaining the enhancement video set by encoding and segmenting the enhancement data set comprises: compressing the enhancement data set by a predetermined video encoding method; and obtaining the enhancement video set by a segmenting process operated on the enhancement data set.
 8. The method according to claim 7, wherein the segmenting process is performed in accordance with a timeline and/or in a spatial dimension.
 9. The method according to claim 8, wherein the spatial dimension is related to a user's viewport.
 10. A system for transcoding a VR video, comprising: a segmentation module, configured to decode source VR video data to obtain an audio data set and a frame data set type one, wherein the source VR video data and the frame data set type one have a source resolution; a first generating module, configured to obtain a frame data set type two from the frame data set type one; a second generating module, configured to obtain a frame data set type three from the frame data set type one, wherein the frame data set type two and the frame data set type three have the same target resolution and are obtained by different manners; a difference calculation module, configured to subtract the frame data set type three from the frame data set type two to obtain an enhancement data set; a combining and segmenting module, configured to combine and segment the frame data set type two and the audio data set to obtain a base video set; an encoding and segmenting module, configured to encode and segment the enhancement data set to obtain an enhancement video set, wherein the base video set and the enhancement video set are used for VR video playbacks; and a storage module, configured to store the base video set and the enhancement video set. 